Please, read my outlines for the data visualization course in the the follow README file here and delves into learning about data visualization.
When I started coding for biology I realize on this amazing challenge about how to tell the history from a bunch of Next Generation sequencing datasets. Visualization of information (from massive data mining in special) become in a nice part of my data scientist training.
We will use high throughput sequencing dataset from a marine non model organism exposed to hidrocarbon polutant. The libraries sequenced were acqured from undifferented (Un) and sexual differented stages ( Male and female). In this experiment time 0 and three corresponded to hidrocarbon pollutant expossition (before and after, respectivily).
Libraries were preprocessing using the standar parameters within trimmomatic and the assembly were performed with trinity. Differential gene expression and annotation were perform with edgeR (an R package) and Trinotate, respectivily.
The chunk-code here correspond to my workflow used in the lab. Hope you enjoy it!
From your work directory clone the follow repo git clone https://github.com/RJEGR/July_2018_bioinfo.git. Example:
mkdir July_2018_bioinfo
cd July_2018_bioinfo
git clone https://github.com/RJEGR/July_2018_bioinfo.git
# library(devtools)
# devtools::install_github("rlbarter/superheat")
# devtools::install_github("slowkow/ggrepel")
# install_github("cstubben/trinotateR")
# devtools::install_github("ropensci/taxa")
# devtools::install_github("grunwaldlab/metacoder")
# if (!require("DT")) install.packages('DT')
# source("https://bioconductor.org/biocLite.R"); biocLite("org.Hs.eg.db"); # (~ 74.3 MB)
# biocLite("ggtree")
# biocLite('phyloseq')
# if(!require(devtools)) install.packages("devtools")
# devtools::install_github("kassambara/ggpubr")
#
dir <- c("/Users/cigom/Documents/GitHub/July_2018_bioinfo/infile/")
# setwd(dir)
x <- read.table(paste0(dir,"diffExpr.P1e-2_C1.matrix_control"))
And read the input file; then let’s make a head(x) from the file:
| T0F3 | T0F4 | T0M1 | T0M4 | T0UR | T0U | |
|---|---|---|---|---|---|---|
| TRINITY_DN29981_c0_g1_i1 | 158.162 | 152.974 | 0.537 | 0.000 | 0.000 | 0.000 |
| TRINITY_DN25178_c2_g1_i1 | 0.337 | 0.181 | 167.292 | 100.095 | 140.610 | 925.271 |
| TRINITY_DN23469_c2_g2_i1 | 0.000 | 0.000 | 24.217 | 0.000 | 312.915 | 3789.992 |
| TRINITY_DN27384_c1_g1_i1 | 54.064 | 150.841 | 0.000 | 0.000 | 0.000 | 0.095 |
| TRINITY_DN32136_c1_g1_i2 | 0.000 | 0.000 | 34.300 | 3.078 | 48.656 | 93.654 |
| TRINITY_DN24700_c1_g1_i3 | 0.000 | 0.000 | 2.190 | 0.000 | 50.995 | 2477.124 |
Let’s log2 transform the data. Why to use the log2 transformation of the normalized count table ? (please read an answer on a researchgate question here)
data = x # restore before doing various data transformations
data = log2(data+1)
data = as.matrix(data) # convert to matrix
data = t(scale(t(data), scale=F)) # Centering rows
Using superheat
library(superheat)
superheat(data,
# retain original order of rows/cols
pretty.order.rows = TRUE,
pretty.order.cols = TRUE,
row.dendrogram = TRUE,
left.label = "none",
bottom.label.text.angle = 0,
row.title = "Differential Expressed",
column.title = "Samples")
example 1
Also use the option membership.cols to group the superheat accordint to any factor, for example: factor <- c(rep("Female",2), rep("Male",2), rep("Undiff",2))
factor <- c(rep("Female",2), rep("Male",2), rep("Undiff",2))
superheat(data,
# retain original order of rows/cols
pretty.order.rows = TRUE,
pretty.order.cols = TRUE,
row.dendrogram = TRUE,
left.label = "none",
row.title = "Differential Expressed",
column.title = "Factor",
membership.cols = factor)
example 2
After process the differential gene expression analysis (Ex. running the run_DE_analysis.pl from Trinity framework) we can improve the data visualization as follow:
DE <- read.table(paste0(dir,"RSEM.isoform.counts.matrix.Female_vs_Undiff.edgeR.DE_results"))
library(ggplot2)
library(scales)
p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) + geom_point()
p
This is a Volcano plot
Let’s label by Fold Change (up/down) rate and significancy by the follow condition:
# logf>abs(2) fdr < 0.05 fdr < 0.05 and logfc> abs(2)
DE$Sig <- "Non Sig or basal"
DE$Sig[(abs(DE$logFC) > 2)] <- "logFC"
DE$Sig[(DE$FDR<0.05) & (abs(DE$logFC)>2)] <- "logFC_FDR"
p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) +
geom_point(aes(color = Sig)) + theme_classic() +
scale_color_brewer()
p
Volcano; Color and fill
Add lines and axis name:
p1 <- p +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
geom_hline(yintercept = -log10(0.0001), linetype = "dashed") +
geom_vline(xintercept = c(-2, 2), linetype = "dashed")
And also let’s rename x and y axis using backquote macros:
p1 + xlab(bquote(~Log[2]~ "fold change")) + ylab(bquote(~-Log[10]~italic(P)))
Volcano lines and axis labeled
# p2 <- p + xlab("Fold change (log2)") + ylab("P-Value")
Finally, let’s label the dots from the scatter:
library(ggrepel)
maxlab <- max(-log10(DE$PValue)) - 1 # select the points below the highest -log10(PValue) value to label
p2 <- p1 + geom_text_repel(
data = subset(DE, -log10(PValue) > maxlab),
aes(label = Sig),
size = 2.5,
box.padding = unit(0.2, "lines"),
point.padding = unit(0.2, "lines")
)
Finally, let’s compare the layouts:
library(gridExtra)
library(grid)
grid_arrange_shared_legend(p,p1,p2, ncol = 3)
Volcano comparison
After assembly denovo or guided transcriptome is common to map each contig to a reference in order to annotate the potential source of each transcript. In this view, blast2go is the average tool used by users to perform this analysis nevertheles blast2go is a non free tool. In spite of, Trinotate is a useful free-framework to this purpose. Trinotate makes use of a number of different well referenced methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and leveraging various annotation databases (eggNOG/GO/Kegg databases).
In personal experience, trinotate had resulted in better performance (such less computational time) than the blast2go counterpart and useful to automatically annotate one workflow at time.
In addition, trinotateR is package developed by Chris Stubben with useful functions to “wrangling” the tab-delimited Trinotate.xls result.
library(trinotateR)
## Loading required package: data.table
y <- read_trinotate(paste0(dir,"Trinotate.xls"))
knitr::kable(summary_trinotate(y))
| unique | total | |
|---|---|---|
| gene_id | 61694 | 148534 |
| transcript_id | 147454 | 148534 |
| prot_id | 55785 | 55785 |
| prot_coords | 24521 | 55785 |
| sprot_Top_BLASTX_hit | 43800 | 45692 |
| gene_ontology_blast | 14000 | 44709 |
| Kegg | 15112 | 39430 |
| eggnog | 5651 | 37535 |
| sprot_Top_BLASTP_hit | 28519 | 35660 |
| Pfam | 26217 | 34005 |
| gene_ontology_pfam | 1775 | 20111 |
| TmHMM | 7764 | 10393 |
| RNAMMER | 9 | 9 |
| SignalP | 0 | 0 |
| transcript | 0 | 0 |
| peptide | 0 | 0 |
Most of the annotations contain mutliple hits in a backtick-delimited list and each hit contains multiple fields in a caret-delimited list. For example, the second Pfam annotation below contains two hits and each hit contains a pfam id, symbol, name, alignment and e-value. The split_pfam functions splits multiple hits and fields, so the second Pfam annotation is now printed in rows 2 and 3 below.
y1 <- split_pfam(y)
## 70856 Pfam annotations
head(y1,3)[,c(2,4:7)]
## transcript pfam symbol
## 1: TRINITY_DN10004_c0_g1_i1 PF01442 Apolipoprotein
## 2: TRINITY_DN10005_c0_g1_i1 PF16489 GAIN
## 3: TRINITY_DN10007_c0_g1_i1 PF00135 COesterase
## name align
## 1: Apolipoprotein A1/A4/E domain 20-67
## 2: GPCR-Autoproteolysis INducing (GAIN) domain 17-121
## 3: Carboxylesterase family 3-85
Finally, the summary_pfam lists both, the number of unique Pfam identifiers and the total genes, transcripts and proteins with a Pfam annotation.
y2 <- summary_pfam(y1)
## 4849 rows
knitr::kable(head(y2[order(-y2$transcripts),]))
| pfam | symbol | name | genes | transcripts | proteins | total |
|---|---|---|---|---|---|---|
| PF00386 | C1q | C1q domain | 201 | 801 | 809 | 856 |
| PF00069 | Pkinase | Protein kinase domain | 279 | 594 | 594 | 611 |
| PF07714 | Pkinase_Tyr | Protein tyrosine kinase | 272 | 583 | 583 | 596 |
| PF00400 | WD40 | WD domain, G-beta repeat | 203 | 461 | 461 | 1351 |
| PF12796 | Ank_2 | Ankyrin repeats (3 copies) | 219 | 455 | 455 | 925 |
| PF00023 | Ank | Ankyrin repeat | 214 | 449 | 449 | 1031 |
The summary table also includes a count attribute with the number of unique genes, transcripts and proteins with a Pfam annotation, as well as the total number of annotations. In this example, there are 33,721 unique transcripts with a Pfam annotation and 56,642 total annotations to transcripts (since those may have more than one Pfam annotation).
c <- attr(y2, "count")
print(c)
## unique annotations
## Pfam 4849 70856
## genes 14417 25280
## transcripts 33721 56642
## proteins 34005 56757
Implement datatable to present interactive data-table: from all the available genes/transcripts annotations. The R package DT is an R interface to the JavaScript library DataTable. It package use R data objects (matrices or data frames) to display it as tables on HTML pages. This object-class could include filtering, pagination, sorting, and many other features in the tables.
z <- data.frame(y2)
z$pfam <- paste0('<a href="http://pfam.xfam.org/family/', z$pfam, '">', z$pfam, '</a>')
library(DT)
datatable(z , escape=1, options = list( pageLength = 10 ) )
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
trinotateR to get the Gene Ontology annotation.A GO annotation is a statement about the function of a particular gene. Each GO annotation consists of an association between a gene and a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. The GO describes function with respect to three aspects: molecular function (molecular-level activities performed by gene products), cellular component (the locations relative to cellular structures in which a gene product performs a function), and biological process (the larger processes, or ‘biological programs’ accomplished by multiple molecular activities)
Reference: http://www.geneontology.org/page/ontology-documentation, 2018
go <- split_GO(y)
## 473221 gene_ontology_blast annotations
gos <- summary_GO(go)
## 15386 rows
knitr::kable(head(gos[order(-gos$transcripts),]))
| go | ontology | name | genes | transcripts | proteins | total |
|---|---|---|---|---|---|---|
| GO:0005737 | cellular_component | cytoplasm | 5087 | 10521 | 8155 | 10550 |
| GO:0005634 | cellular_component | nucleus | 4286 | 8687 | 6947 | 8725 |
| GO:0016021 | cellular_component | integral component of membrane | 3839 | 7817 | 6020 | 7832 |
| GO:0005829 | cellular_component | cytosol | 3508 | 7300 | 5854 | 7325 |
| GO:0005886 | cellular_component | plasma membrane | 3366 | 6305 | 4961 | 6315 |
| GO:0046872 | molecular_function | metal ion binding | 2771 | 5815 | 4421 | 5829 |
And also determine the counts
cgo <- attr(gos, "count")
print(cgo)
## unique annotations
## GO 15386 473221
## genes 19818 241902
## transcripts 44415 472268
## proteins 34463 379068
Using only the translated transcripts to get the differential Expressed annotations:
got <- na.omit(go, cols = "protein")
x$transcript <- rownames(x)
m <- merge(got, x, suffix = c("transcript"), all=FALSE)
m <- m[order(m$transcript),c(1,4,5:12)]
library(DT)
datatable(m , filter = 'top', escape=1, options = list( pageLength = 10 ),
rownames = FALSE)
Let’s draw some visualizations from the annotation enrichment throught a the transcriptome assembly using the package ggpubr:
library(ggpubr)
## Loading required package: magrittr
plotgos <- head(gos[order(-gos$transcripts),], 80)
ggbarplot(plotgos, "name", "transcripts",
fill = "ontology",
color = "ontology",
palette = c("#00AFBB", "#E7B800", "#FC4E07")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7))
Gene Ontology enrichment plot (top 80 transcripts)
# facet_grid(ontology ~ .,) + theme(#strip.text.x = element_blank(),
# axis.text.x = element_text(angle = 90, hjust = 1))
And separate the ontology annotation for further analysis:
| Var1 | Freq |
|---|---|
| biological_process | 10236 |
| cellular_component | 1571 |
| molecular_function | 3579 |
You can separete the go terms to perfom further test:
MF <- gos[gos$ontology=="molecular_function",]
CC <- gos[gos$ontology=="cellular_component",]
BP <- gos[gos$ontology=="biological_process",]
Determine the similarity of two GO terms based on the annotation statistics of their common ancestor terms by computing semantic similarity among GO terms, sets of GO terms, gene products, and gene clusters, providing different methods than measure the information content (IC). To details please read the Wang’s paper published in Oxford.
First, build annotation data needed by GOSemSim via godata function. Based in figure from Gene Ontology Enrichment, we could focus on the Cellular component (CC) terms.
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:data.table':
##
## first, second
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:data.table':
##
## shift
##
hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="CC", computeIC=FALSE)
##
## preparing gene to GO mapping data...
go <- as.vector(CC$go)
go1 <- sample(go, 20)
go2 <- sample(go, 20)
gosim <- GOSemSim::mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
And visualize the similarity of the GO term.
superheat(gosim,
# retain original order of rows/cols
pretty.order.rows = TRUE,
pretty.order.cols = TRUE,
#left.label = "none",
bottom.label.text.angle = 90,
row.title = "Sample 1",
column.title = "Sample 2",
bottom.label.text.size = 4,
left.label.text.size = 4
)
Finally detache (unload) org.Hs.eg.db package.
detach(package:org.Hs.eg.db, unload = TRUE)
From the annotation, lets use the blastx result in order to get the lineage from proteins anntotated using a modified function from trinotateR
split_blast2 <- function (x, hit = "sprot_Top_BLASTX_hit")
{
y <- x[!is.na(get(hit)), .(get(hit), gene_id, transcript_id,
prot_id)]
z <- strsplit(y$V1, "`")
n <- sapply(z, length)
z <- strsplit(unlist(z), "\\^")
if (any(sapply(z, "[", 1) != sapply(z, "[", 2)))
print("WARNING: check different values in columns 1 and 2")
NAME <- gsub("^RecName: Full=", "", sapply(z, "[", 6))
NAME <- gsub("SubName: Full=", "", NAME)
NAME <- gsub(";$", "", NAME)
NAME <- gsub(" \\{[^}]+}", "", NAME)
x1 <- data.frame(gene = rep(y$gene_id, n), transcript = rep(y$transcript_id,
n), protein = rep(gsub(".*\\|", "", y$prot_id), n), uniprot = sapply(z,
"[", 1), align = sapply(z, "[", 3), identity = as.numeric(gsub("%ID",
"", sapply(z, "[", 4))), evalue = as.numeric(gsub("E:",
"", sapply(z, "[", 5))), name = NAME, lineage = sapply(z,
"[", 7), domain = gsub("; .*", "", sapply(z, "[", 7)),
genus = gsub(".*; ", "", sapply(z, "[", 7)), stringsAsFactors = FALSE)
message(nrow(x1), " ", hit, " annotations")
data.table(x1)
}
blast <- split_blast2(y)
## 63540 sprot_Top_BLASTX_hit annotations
blast2 <- summary_blast(blast)
## 17039 rows
knitr::kable(head(blast2[order(-blast2$transcripts),]))
| uniprot | domain | genus | name | genes | transcripts | proteins | total |
|---|---|---|---|---|---|---|---|
| HS12B_HUMAN | Eukaryota | Homo | Heat shock 70 kDa protein 12B | 60 | 136 | 121 | 146 |
| C1QL4_MOUSE | Eukaryota | Mus | Complement C1q-like protein 4 | 47 | 122 | 106 | 124 |
| HS12A_MOUSE | Eukaryota | Mus | Heat shock 70 kDa protein 12A | 51 | 113 | 90 | 116 |
| HS12A_HUMAN | Eukaryota | Homo | Heat shock 70 kDa protein 12A | 53 | 99 | 64 | 99 |
| PLCL_MYTGA | Eukaryota | Mytilus | Perlucin-like protein | 38 | 98 | 69 | 98 |
| PLC_HALLA | Eukaryota | Haliotis | Perlucin | 34 | 97 | 81 | 100 |
Do you think we can re-annotate with the Gene ontology the Differential Expression data-set (DE object) and display a new annotated volcano plot as in the fist session ?
DE$transcript <- rownames(DE)
DEannot <- merge(DE, blast, suffix = c("transcript"), all=FALSE)
p1 + geom_text_repel(
data = subset(DEannot, -log10(PValue) > 7),
aes(label = name),
size = 2.5,
box.padding = unit(0.2, "lines"),
point.padding = unit(0.2, "lines")
)
This is a Volcano plot annotated
Creates nice looking color palettes (discrete). Visit the follow cookbook to details
library(RColorBrewer)
display.brewer.all()
# display.brewer.pal(n = 5, name = 'Set3') # Hexadecimal color specification
Could we color with orange discrete scale the
gosimobject usingsuperheat?. superheat could perform this by using the optionheat.paloption.
superheat(gosim,
# retain original order of rows/cols
pretty.order.rows = TRUE,
pretty.order.cols = TRUE,
#left.label = "none",
bottom.label.text.angle = 90,
row.title = "Sample 1",
column.title = "Sample 2",
bottom.label.text.size = 4,
left.label.text.size = 4,
heat.pal = brewer.pal(n = 9, name = "Oranges"))
Visualizing hierarchical information
Metacoder provides an alternative visulization we call heat trees to plot hierarchical information, in special the organism abundance data classified by a taxonomy.
Let’s subset the virus lineage annotation:
knitr::kable(table(blast$domain))
| Var1 | Freq |
|---|---|
| . | 16 |
| Archaea | 137 |
| Bacteria | 1245 |
| Eukaryota | 61817 |
| Viruses | 325 |
vs <- blast[blast$domain == "Viruses",]
list <- vs$lineage
And plot …
library(metacoder)
## Loading required package: taxa
##
## Attaching package: 'taxa'
## The following object is masked from 'package:ggplot2':
##
## map_data
## This is metacoder verison 0.2.1.9008 (development version). If you use metacoder for published research, please cite our paper:
##
## Foster Z, Sharpton T and Grunwald N (2017). "Metacoder: An R package for visualization and manipulation of community taxonomic diversity data." PLOS Computational Biology, 13(2), pp. 1-15. doi: 10.1371/journal.pcbi.1005404
##
## Enter `citation("metacoder")` for a BibTeX entry for this citation.
##
## Attaching package: 'metacoder'
## The following object is masked from 'package:IRanges':
##
## reverse
library(RColorBrewer)
obj <- parse_tax_data(vs, class_cols = "lineage", class_sep = ";")
heat_tree(obj,
node_label = taxon_names,
node_size = n_obs,
node_color = n_obs,
#edge_color = "grey",
node_color_range = brewer.pal(n = 10, name = "Oranges"))
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
Tree example using metacoder
Which annotation could be true, let’s show the identity ?
heat_tree(obj,
edge_label = taxon_names,
edge_label_size = 0.5,
node_color = identity,
edge_color = "grey",
node_color_range = brewer.pal(n = 10, name = "Oranges"),
node_size_axis_label = "n_obs",
node_color_axis_label = "Identity")
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
## Warning: NAs found in the "node_color" option. These may cause plotting errors or other strange behavior. NAs can be created if there is not a value for each taxon. The following 28 of 48 taxa have NAs for the "node_color" option:
## ab, ac, ad, ae, af, ag, ah, ai ... ba, bb, bh, bj, bm, bn, bo
Tree example 2; labeling by identity
The ggtree is designed by extending the ggplot2 (Wickham 2009) package. It is based on the grammar of graphics and takes all the good parts of ggplot2. ggtree supports several layouts, including rectangular, slanted, circular and fan for phylogram and cladogram, equal_angle and daylight for unrooted layout, time-scaled and two dimentional phylogenies.
Reference: Yu et al. 2017, https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12628
tbl <- head(vs[order(-vs$identity),], 50)
tbl <- tbl[!duplicated(tbl$transcript),] # remove duplicates annotations
write.table(tbl$transcript, file = paste0(dir, "virus.list"), quote = FALSE, col.names = FALSE, row.names = FALSE)
Run in clustalo web-service and input again in r the tree this is a Neighbour-joining tree without distance corrections.
tree <- treeio::read.newick(paste0(dir,"virus.tree.corrected.txt"))
And plot
library(ggtree)
## ggtree v1.12.0 For help: https://guangchuangyu.github.io/software/ggtree
##
## If you use ggtree in published research, please cite:
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:IRanges':
##
## collapse
## The following object is masked from 'package:S4Vectors':
##
## expand
## The following object is masked from 'package:ggpubr':
##
## rotate
## The following object is masked from 'package:magrittr':
##
## inset
multiplot(
ggtree(tree, branch.length = "none") + geom_treescale(width=0.4),
ggtree(tree, branch.length='none', layout="daylight") + geom_treescale(width=0.4) + geom_nodepoint(),
ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4),
ncol=3,labels = LETTERS[1:3])
## Average angle change [1] 0.099133807614635
## Average angle change [2] 0.0415093787405126
ggtree; using different layouts
Then,
p <- ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4) +
geom_nodepoint(color="#b5e521", alpha=1/4, size=4)
Modify tips and node geometry
multiplot(# add node points
p + geom_nodepoint(),
# add tip points
p + geom_tippoint(),
# Label the tips
p + geom_tiplab(),
ncol=3,labels = LETTERS[1:3])
ggtree; using different layouts
Use annotation information to merge within object (ie. tree and dataframe)
d <- p + geom_nodepoint()
tbl <- tbl[,-c(1)]
colnames(tbl)[1] <- "label"
tbl$genus <- gsub("unclassified ", "", tbl$genus)
t1 <- d %<+% tbl +
geom_point(aes(color=genus), size=5, alpha=.5, na.rm = TRUE) + theme(legend.position="bottom")
t2 <- p %<+% tbl +
geom_tiplab(size=2.5, aes(label=paste0('italic(', uniprot, ')')), parse=TRUE)
multiplot(t1, t2, ncol=2)
ggtree; labeling tree based on data-frame
This part of the course was cloned from a current version tutorial from https://joey711.github.io and is reproduced to demonstrate the simpleness of this tool
otus = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10)
rownames(otus) <- paste0("OTU", 1:nrow(otus)) # name the rows with a label of name OTU#
colnames(otus) <- paste0("Sample", 1:ncol(otus)) # name the columns with a label of name Sample#
knitr::kable(otus)
| Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 | Sample7 | Sample8 | Sample9 | Sample10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| OTU1 | 69 | 29 | 17 | 22 | 93 | 8 | 39 | 3 | 36 | 87 |
| OTU2 | 62 | 32 | 37 | 65 | 74 | 46 | 51 | 48 | 66 | 1 |
| OTU3 | 25 | 4 | 85 | 35 | 98 | 8 | 93 | 94 | 67 | 37 |
| OTU4 | 54 | 46 | 3 | 87 | 54 | 81 | 47 | 54 | 94 | 72 |
| OTU5 | 62 | 90 | 67 | 40 | 60 | 77 | 89 | 27 | 53 | 44 |
| OTU6 | 73 | 34 | 24 | 66 | 67 | 93 | 48 | 27 | 78 | 4 |
| OTU7 | 49 | 93 | 22 | 78 | 11 | 38 | 93 | 14 | 90 | 83 |
| OTU8 | 1 | 19 | 83 | 30 | 23 | 92 | 5 | 52 | 30 | 54 |
| OTU9 | 51 | 24 | 55 | 70 | 60 | 49 | 97 | 86 | 26 | 12 |
| OTU10 | 60 | 70 | 9 | 1 | 40 | 74 | 55 | 61 | 41 | 97 |
Let’s make a tax table
taxmat = matrix(sample(letters, 70, replace = TRUE),
nrow = nrow(otus),
ncol = 7)
rownames(taxmat) <- rownames(otus) # The rownames must match the OTU names (taxa_names) of the otus
colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
knitr::kable(taxmat)
| Domain | Phylum | Class | Order | Family | Genus | Species | |
|---|---|---|---|---|---|---|---|
| OTU1 | f | v | r | p | g | z | z |
| OTU2 | d | q | g | p | x | w | c |
| OTU3 | m | a | i | v | i | l | t |
| OTU4 | m | a | b | o | u | l | q |
| OTU5 | l | u | e | b | s | x | t |
| OTU6 | m | r | x | n | x | r | m |
| OTU7 | g | w | i | k | f | q | j |
| OTU8 | c | v | j | h | m | j | r |
| OTU9 | i | u | l | n | w | x | p |
| OTU10 | w | y | o | y | m | q | q |
Note than both, taxmat and otus, are objects of class matrix
class(taxmat)
## [1] "matrix"
class(otus)
## [1] "matrix"
….
library("phyloseq")
##
## Attaching package: 'phyloseq'
## The following object is masked from 'package:taxa':
##
## filter_taxa
## The following object is masked from 'package:IRanges':
##
## distance
## The following object is masked from 'package:Biobase':
##
## sampleNames
OTU = otu_table(otus, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
physeq = phyloseq(OTU, TAX)
print(physeq)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10 taxa and 10 samples ]
## tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
Let’s visualize some barplots:
theme_set( theme_classic())
bar <- plot_bar(physeq, fill = "Genus")
# Color with scale_fill_brewer
bar + geom_bar(stat="identity") +
scale_fill_brewer(palette="Spectral")
phyloseq; barplot 1
Create random sample data, and add that to the combined dataset. Make sure that the sample names match the sample_names of the otu_table.
sampledata = data.frame(
Location = sample(LETTERS[1:4], size=nsamples(physeq), replace=TRUE),
Depth = sample(50:1000, size=nsamples(physeq), replace=TRUE),
row.names=sample_names(physeq),
stringsAsFactors=FALSE
)
knitr::kable(sampledata)
| Location | Depth | |
|---|---|---|
| Sample1 | A | 683 |
| Sample2 | C | 251 |
| Sample3 | A | 668 |
| Sample4 | A | 755 |
| Sample5 | C | 354 |
| Sample6 | C | 986 |
| Sample7 | D | 620 |
| Sample8 | B | 182 |
| Sample9 | D | 277 |
| Sample10 | A | 56 |
sampledata <- sample_data(sampledata) # save as phyloseq-object
Finally, merge sampledata within phyloseq-object
physeq1 = merge_phyloseq(physeq, sampledata)
print(physeq1)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10 taxa and 10 samples ]
## sample_data() Sample Data: [ 10 samples by 2 sample variables ]
## tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
And improve the plots. Keep the same fill color, and group the samples together by the Location variable (essentially the environment from which the sample was taken and sequenced) described in the sample_data(phyloseq1):
plot_bar(physeq1, x="Location", fill="Genus") +
geom_bar(stat="identity") +
scale_fill_brewer(palette="Spectral")
phyloseq; barplot 2
Also we can organize the information according to any factor using facets. Let’s use the Location in order to group the plots, this time xlab will be labeled with Family name, and the fill of the bars will be colored by Genus level.
plot_bar(physeq1, "Family", fill="Genus", facet_grid=~Location) +
geom_bar(stat="identity") +
scale_fill_brewer(palette="Spectral")
barplot with facet_grid
You can improve you barpots as phyloseq authors sugguest here
Also we could implement tree information associated to the physeq1 object. Let’s create a random phylogenetic tree with the ape package, and add it to your dataset. Make sure its tip labels match your OTU_table.
library("ape")
##
## Attaching package: 'ape'
## The following object is masked from 'package:ggtree':
##
## rotate
## The following object is masked from 'package:metacoder':
##
## complement
## The following object is masked from 'package:ggpubr':
##
## rotate
rtree = rtree(ntaxa(physeq1), rooted=TRUE, tip.label=taxa_names(physeq1))
Great, lets merge the tree in the physoleq- object
physeq1 = merge_phyloseq(physeq1, rtree)
physeq1
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10 taxa and 10 samples ]
## sample_data() Sample Data: [ 10 samples by 2 sample variables ]
## tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 10 tips and 9 internal nodes ]
Also you can make a denovo phyloseq-object construction from the scratch by: physeq1 = phyloseq(OTU, TAX, sampledata, rtree).
plot_tree(physeq1,
color="Location",
label.tips="taxa_names",
size="abundance",
ladderize="left",
plot.margin=0.3)
phyloseq; tree with location labels
plot_tree(physeq1,
color="Depth",
shape="Location",
label.tips="taxa_names",
ladderize="right",
plot.margin=0.3)
phyloseq; tree with location labels 2
plot_heatmap(physeq1)
plot heatmap from phyloseq-objects
Improve the heatmap by adding the taxa names in the ylab and order the xlab by the ecological distance using an ordination method.
h <- plot_heatmap(physeq1, low="#66CCFF", high="#000033",
taxa.label="Family",
method = "NMDS", distance = "bray"
)
print(h)
plot heatmap from phyloseq-objects using ordination and taxa names
We can use the ggplot2 functions to make a final arragement of the heatmap
h + facet_grid(Class ~ . + Location, scales = "free", space = "free")
and dotplot
dp <- phyloseq::psmelt(physeq1)
library(ggpubr)
ggdotchart(dp, x = "Sample", y ="Abundance",
ggtheme = theme_bw())
and
ggdotchart(dp, x = "Sample", y ="Abundance",
group = "Location", color = "Domain",
palette = "jco",
rotate = TRUE,
sorting = "descending",
ggtheme = theme_bw(),
y.text.col = TRUE,
dot.size = "Abundance")
See the multiqc_report.html in the infile folder. If you’re interested in do this in-lab-plots, please visit the brief pre-processing step documentation.
This tutorial were performed using the R session:
print(sessionInfo())
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] ape_5.1 phyloseq_1.24.2 ggtree_1.12.0
## [4] metacoder_0.2.1.9008 taxa_0.2.1.9009 RColorBrewer_1.1-2
## [7] AnnotationDbi_1.42.1 IRanges_2.14.10 S4Vectors_0.18.3
## [10] Biobase_2.40.0 BiocGenerics_0.26.0 ggpubr_0.1.6
## [13] magrittr_1.5 DT_0.4 trinotateR_1.0
## [16] data.table_1.11.4 gridExtra_2.3 ggrepel_0.8.0
## [19] scales_0.5.0 ggplot2_3.0.0 bindrcpp_0.2.2
## [22] superheat_1.0.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-137 bit64_0.9-7 ggsci_2.9
## [4] rprojroot_1.3-2 tools_3.5.0 backports_1.1.2
## [7] vegan_2.5-2 R6_2.2.2 mgcv_1.8-23
## [10] DBI_1.0.0 lazyeval_0.2.1 colorspace_1.3-2
## [13] permute_0.9-4 ade4_1.7-11 withr_2.1.2
## [16] tidyselect_0.2.4 bit_1.1-14 compiler_3.5.0
## [19] cli_1.0.0 ggdendro_0.1-20 labeling_0.3
## [22] stringr_1.3.1 digest_0.6.15 rmarkdown_1.10
## [25] XVector_0.20.0 pkgconfig_2.0.1 htmltools_0.3.6
## [28] GA_3.1.1 highr_0.7 htmlwidgets_1.2
## [31] rlang_0.2.1 RSQLite_2.1.1 shiny_1.1.0
## [34] bindr_0.1.1 jsonlite_1.5 crosstalk_1.0.0
## [37] GOSemSim_2.6.0 dplyr_0.7.6 GO.db_3.6.0
## [40] Matrix_1.2-14 biomformat_1.8.0 Rhdf5lib_1.2.1
## [43] Rcpp_0.12.18 munsell_0.5.0 ggfittext_0.6.0
## [46] stringi_1.2.4 yaml_2.1.19 MASS_7.3-49
## [49] zlibbioc_1.26.0 rhdf5_2.24.0 plyr_1.8.4
## [52] blob_1.1.1 promises_1.0.1 crayon_1.3.4
## [55] lattice_0.20-35 splines_3.5.0 Biostrings_2.48.0
## [58] multtest_2.36.0 knitr_1.20 pillar_1.2.3
## [61] igraph_1.2.2 reshape2_1.4.3 codetools_0.2-15
## [64] glue_1.2.0 evaluate_0.10.1 treeio_1.4.1
## [67] httpuv_1.4.4.1 foreach_1.4.4 gtable_0.2.0
## [70] purrr_0.2.5 tidyr_0.8.1 assertthat_0.2.0
## [73] mime_0.5 xtable_1.8-2 tidytree_0.1.9
## [76] later_0.7.3 survival_2.41-3 tibble_1.4.2
## [79] iterators_1.0.10 rvcheck_0.1.0 memoise_1.1.0
## [82] cluster_2.0.7-1